374 research outputs found

    A Survey of FPGA Optimization Methods for Data Center Energy Efficiency

    Get PDF
    This article provides a survey of academic literature about field programmable gate array (FPGA) and their utilization for energy efficiency acceleration in data centers. The goal is to critically present the existing FPGA energy optimization techniques and discuss how they can be applied to such systems. To do so, the article explores current energy trends and their projection to the future with particular attention to the requirements set out by the European Code of Conduct for Data Center Energy Efficiency. The article then proposes a complete analysis of over ten years of research in energy optimization techniques, classifying them by purpose, method of application, and impacts on the sources of consumption. Finally, we conclude with the challenges and possible innovations we expect for this sector.Comment: Accepted for publication in IEEE Transactions on Sustainable Computin

    Platform-Aware FPGA System Architecture Generation based on MLIR

    Full text link
    FPGA acceleration is becoming increasingly important to meet the performance demands of modern computing, particularly in big data or machine learning applications. As such, significant effort is being put into the optimization of the hardware accelerators. However, integrating accelerators into modern FPGA platforms, with key features such as high bandwidth memory (HBM), requires manual effort from a platform expert for every new application. We propose the Olympus multi-level intermediate representation (MLIR) dialect and Olympus-opt, a series of analysis and transformation passes on this dialect, for representing and optimizing platform aware system level FPGA architectures. By leveraging MLIR, our automation will be extensible and reusable both between many sources of input and many platform-specific back-ends.Comment: Accepted for presentation at the CPS workshop 2023 (http://www.cpsschool.eu/cps-workshop

    Enabling Automated Bug Detection for IP-based Designs using High-Level Synthesis

    Get PDF
    Modern System-on-Chip (SoC) architectures are increasingly composed of Intellectual Property (IP) blocks, usually designed and provided by different vendors. This burdens system designers with complex system-level integration and verification. In this paper, we propose an approach that leverages HLS techniques to automatically find bugs in designs composed of multiple IP blocks. Our method is particularly suitable for industrial adoption because it works without exposing sensitive information (e.g., the design specification or the component generation process). This advocates the definition and the adoption of an interoperable format for cross-vendor hardware bug detection

    Performance Estimation of Task Graphs Based on Path Profiling

    Get PDF
    Correctly estimating the speed-up of a parallel embedded application is crucial to efficiently compare different parallelization techniques, task graph transformations or mapping and scheduling solutions. Unfortunately, especially in case of control-dominated applications, task correlations may heavily affect the execution time of the solutions and usually this is not properly taken into account during performance analysis. We propose a methodology that combines a single profiling of the initial sequential specification with different decisions in terms of partitioning, mapping, and scheduling in order to better estimate the actual speed-up of these solutions. We validated our approach on a multi-processor simulation platform: experimental results show that our methodology, effectively identifying the correlations among tasks, significantly outperforms existing approaches for speed-up estimation. Indeed, we obtained an absolute error less than 5 % in average, even when compiling the code with different optimization levels

    Iris: Automatic Generation of Efficient Data Layouts for High Bandwidth Utilization

    Get PDF
    Optimizing data movements is becoming one of the biggest challenges in heterogeneous computing to cope with data deluge and, consequently, big data applications. When creating specialized accelerators, modern high-level synthesis (HLS) tools are increasingly efficient in optimizing the computational aspects, but data transfers have not been adequately improved. To combat this, novel architectures such as High-Bandwidth Memory with wider data busses have been developed so that more data can be transferred in parallel. Designers must tailor their hardware/software interfaces to fully exploit the available bandwidth. HLS tools can automate this process, but the designer must follow strict coding-style rules. If the bus width is not evenly divisible by the data width (e.g., when using custom-precision data types) or if the arrays are not power-of-two length, the HLS-generated accelerator will likely not fully utilize the available bandwidth, demanding even more manual effort from the designer. We propose a methodology to automatically find and implement a data layout that, when streamed between memory and an accelerator, uses a higher percentage of the available bandwidth than a naive or HLS-optimized design. We borrow concepts from multiprocessor scheduling to achieve such high efficiency.Comment: Accepted for presentation at ASPDAC'2

    Bridging the Gap between Software and Hardware Designers Using High-Level Synthesis

    Get PDF
    Modern Systems-on-Chip (SoC) architectures and CPU+FPGA computing platforms are moving towards heterogeneous systems featuring an increasing number of hardware accelerators. These specialized components can deliver energy-efficient high performance, but their design from high-level specifications is usually very complex. Therefore, it is crucial to understand how to design and optimize such components to implement the desired functionality. This paper discusses the challenges between software programmers and hardware designers, focusing on the state-of-the-art methods based on high-level synthesis (HLS). It also highlights the future research lines for simplifying the creation of complex accelerator-based architectures

    Dataflow Computing with Polymorphic Registers

    Get PDF
    Heterogeneous systems are becoming increasingly popular for data processing. They improve performance of simple kernels applied to large amounts of data. However, sequential data loads may have negative impact. Data parallel solutions such as Polymorphic Register Files (PRFs) can potentially accelerate applications by facilitating high speed, parallel access to performance-critical data. Furthermore, by PRF customization, specific data path features are exposed to the programmer in a very convenient way. PRFs allow additional control over the registers dimensions, and the number of elements which can be simultaneously accessed by computational units. This paper shows how PRFs can be integrated in dataflow computational platforms. In particular, starting from an annotated source code, we present a compiler-based methodology that automatically generates the customized PRFs and the enhanced computational kernels that efficiently exploit them

    The Case for Polymorphic Registers in Dataflow Computing

    Get PDF
    Heterogeneous systems are becoming increasingly popular, delivering high performance through hardware specialization. However, sequential data accesses may have a negative impact on performance. Data parallel solutions such as Polymorphic Register Files (PRFs) can potentially accelerate applications by facilitating high-speed, parallel access to performance-critical data. This article shows how PRFs can be integrated into dataflow computational platforms. Our semi-automatic, compiler-based methodology generates customized PRFs and modifies the computational kernels to efficiently exploit them. We use a separable 2D convolution case study to evaluate the impact of memory latency and bandwidth on performance compared to a state-of-the-art NVIDIA Tesla C2050 GPU. We improve the throughput up to 56.17X and show that the PRF-augmented system outperforms the GPU for 9×9 or larger mask sizes, even in bandwidth-constrained systems

    ASSURE: RTL Locking Against an Untrusted Foundry

    Get PDF
    Semiconductor design companies are integrating proprietary intellectual property (IP) blocks to build custom integrated circuits (IC) and fabricate them in a third-party foundry. Unauthorized IC copies cost these companies billions of dollars annually. While several methods have been proposed for hardware IP obfuscation, they operate on the gate-level netlist, i.e., after the synthesis tools embed the semantic information into the netlist. We propose ASSURE to protect hardware IP modules operating on the register-transfer level (RTL) description. The RTL approach has three advantages: (i) it allows designers to obfuscate IP cores generated with many different methods (e.g., hardware generators, high-level synthesis tools, and pre-existing IPs). (ii) it obfuscates the semantics of an IC before logic synthesis; (iii) it does not require modifications to EDA flows. We perform a cost and security assessment of ASSURE.Comment: Submitted to IEEE Transactions on VLSI Systems on 11-Oct-2020, 28-Jan-202

    Optimizing the Use of Behavioral Locking for High-Level Synthesis

    Full text link
    The globalization of the electronics supply chain requires effective methods to thwart reverse engineering and IP theft. Logic locking is a promising solution, but there are many open concerns. First, even when applied at a higher level of abstraction, locking may result in significant overhead without improving the security metric. Second, optimizing a security metric is application-dependent and designers must evaluate and compare alternative solutions. We propose a meta-framework to optimize the use of behavioral locking during the high-level synthesis (HLS) of IP cores. Our method operates on chip's specification (before HLS) and it is compatible with all HLS tools, complementing industrial EDA flows. Our meta-framework supports different strategies to explore the design space and to select points to be locked automatically. We evaluated our method on the optimization of differential entropy, achieving better results than random or topological locking: 1) we always identify a valid solution that optimizes the security metric, while topological and random locking can generate unfeasible solutions; 2) we minimize the number of bits used for locking up to more than 90% (requiring smaller tamper-proof memories); 3) we make better use of hardware resources since we obtain similar overheads but with higher security metric.Comment: Accepted for publication in IEEE Transactions on Computer-Aided Design of Integrated Circuits and System
    • …
    corecore